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ABSTRACT 

We present the design of a structured search engine which re- 
turns a multi-column table in response to a query consisting 
of keywords describing each of its columns. We answer such 
queries by exploiting the millions of tables on the Web be- 
cause these are much richer sources of structured knowledge 
than free-format text. However, a corpus of tables harvested 
from arbitrary HTML web pages presents huge challenges of 
diversity and redundancy not seen in centrally edited knowl- 
edge bases. We concentrate on one concrete task in this pa- 
per. Given a set of Web tables Ti,. . . ,T n , and a query Q 
with q sets of keywords Qi, . . . ,Q q , decide for each Ti if it 
is relevant to Q and if so, identify the mapping between the 
columns of Ti and query columns. We represent this task as 
a graphical model that jointly maps all tables by incorpo- 
rating diverse sources of clues spanning matches in different 
parts of the table, corpus-wide co-occurrence statistics, and 
content overlap across table columns. We define a novel 
query segmentation model for matching keywords to table 
columns, and a robust mechanism of exploiting content over- 
lap across table columns. We design efficient inference algo- 
rithms based on bipartite matching and constrained graph 
cuts to solve the joint labeling task. Experiments on a work- 
load of 59 queries over a 25 million web table corpus shows 
significant boost in accuracy over baseline IR methods. 

1. INTRODUCTION 

We consider the following structured Web search prob- 
lem. A user wants to assemble a multi column table rep- 
resenting either entities with an optional set of attributes, 
or multi-ary relationships among entities. He expresses his 
query as sets of keywords, one set for each column he wishes 
to see in the answer table. Examples include, single col- 
umn keyword queries like "Mountains in North America" 
to retrieve names of entities; two column keyword queries 
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like "Pain killer | Side effects" to retrieve instances of rela- 
tionship between two entities; three column keyword queries 
like "Cheese name | Country of origin | Milk source" to find 
entities along with values of two attributes. We tap the 
large number of organically created tables on the Web to 
answer such queries. In a recent 500 million pages Web 
crawl, we conservatively estimated that over 25 million ta- 
bles express structured information. Similar statistics have 
been reported elsewhere on other web crawls [4, 3, 7]. Each 
table contributes valuable facts about entities, their types, 
and relationships between them, and does so in a manner 
that is considerably less diverse and less noisy, compared to 
how facts are expressed in free-format text on the general 
Web. This makes web tables an extremely valuable resource 
for answering structured queries on the Web. 

The power of Web tables to answer ad hoc structured 
queries has largely been untapped. One exception is the 
Octopus system [2] that supports keyword queries such as 
"Mountains in North America" and returns a ranked list of 
tables that match those keywords and thereafter depends on 
multi-round user interactions to assemble an answer table. 
The system is targeted for topic queries, and is not suitable 
for ad hoc retrieval of relationships or specific attribute sets 
of an entity type. Multi-column keyword queries provide a 
unified mechanism to query for entities, attributes of enti- 
ties, and relationships, while being only a small extension of 
the familiar search box keyword queries. 

We present the design of an end to end system called 
WWT that takes as input a large corpus D of Web tables 
and in response to a query Q with q column keyword sets 
Qi, . . . , Q q returns as answer a single q column table. Unlike 
standard IR systems, our goal is not just to present a ranked 
list of matching Web tables from D, but to extract relevant 
columns from several matching web tables and consolidate 
them into a single structured table. An illustrative scenario 
is shown in Figure 1. The user is interested in compiling 
a table about explorers listing their names, nationality and 
areas explored. He submits three sets of column keywords 
Qi,Q2,Qi as a query as shown in the top left part of the 
figure. We show snippets of three of the several tables that 
match the query keywords. With each table snippet we also 
show some text, that we call context that was extracted from 
around the table in the web document that contained the 
table. WWT processes these tables to get a final consoli- 
dated table as shown as a snippet in the top right corner of 
the figure. 

A crucial challenge in converting the web tables into a 
consolidated answer table is deciding if a web table is rel- 
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Column Descriptors 



Answer Table 



Name of Explorers Nationality Areas Explored 



Qi 



Q 2 



| WebTableTl 



List of explorers - Wikipedia, the free 
encyclopedia 



Name 


Nationality 


Main areas 






explored 


Abel Tasman 


Dutch 


Oceania 


Vasco da Gama 


Portuguese 


Sea route to India 


Alexander Mackenzie 


British 


Canada 









Name of Explorers 


Nationality 


Areas Explored 


Vasco da Gama 


Portuguese 


Sea route to India 


Abel Tasman 


Dutch 


Oceania 


Christopher Columbus 




Caribbean 









| WebTable2~| 



This article lists the explorations in history. For 
the documentary 'Explorations, powered by 
Duracell', see Explorations (TV) 



Exploration 


Who (explorer) 


(Chronological order) 




Sea route to India 


Vasco da Gama 


Caribbean 


Christopher Columbus 


Oceania 


Abel Tasman 







Web Table 3 



Other Formal Reserves 1 .3 Forest Reserves 
under the Forestry Act 1 920 
All areas will be available for mineral 
exploration and mining 



Forest reserves 






ID 


Name 


Area 


7 


Shakespeare Hills 


2236 


g 


Plains Creek 


880 


13 


Welcome Swamp 


168 









Figure 1: An example column description query. 



evant, and if so, matching the columns of the table to the 
query columns. We call this the Column mapping task. In 
Figure 1, the task should label Table 1 as relevant and map 
its columns consecutively to Qi,Q2,Qa, label Table 2 as rel- 
evant and map column 1 to Qi and column 2 to Qi, and la- 
bel Table 3 as irrelevant. The Column mapping task is more 
challenging than relevance ranking entire tables because the 
amount of column specific information from both the query 
and tables is limited. From the query side, the only infor- 
mation is a set of keywords, and from the web table side the 
only obvious column-specific information is in the header of 
the table. Headers of tables on the Web are inadequate in 
several ways: most Web tables do not use the designated 
HTML tag for indicating headers (80% in our corpus) and 
are identified via imperfect automated methods, many ta- 
bles have no headers (18% in our corpus), header text is 
often uninformative (e.g. "Name"), many tables have mul- 
tiple headers but without any clearcut semantics (do they 
represent split phrases as in Table 1 of Figure 1, or addi- 
tional information as in Table 2, or a title as in Table 3?). 
The context text of a table is useful but it does not give 
column-specific information. How can matches in the con- 
text, common to all columns of a table, help us discriminate 
one column from another? We make a number of contribu- 
tions to address these challenges. 

Contributions. We propose a novel method to combine the 
clues in one column's header with the signals from the rest of 
the table (such as its context, and body) through a two-part 
query segmentation model. We show that the segmentation 
model is more precise than the IR practice of weighted sum 
of whole string matches along each field. Second, we show 
how to exploit the content overlap of a column with columns 
of other tables to help tables with poorly matched headers. 
We use an elegant formulation based on graphical models 
to jointly map all columns of all tables to the query table. 
We discuss a number of non-trivial choices in the design of 
the graphical model, including the choice of the variables, 
the node potentials that incorporate diverse clues from the 
table and the corpus, and, edge potentials that exploit con- 
tent overlap while being robust to the presence of several 
irrelevant web tables. Finally, we design efficient algorithms 



to solve the joint column mapping task. Experiments on a 
workload of 59 queries show a boost of Fl accuracy from 
65% using baseline IR method to 70% using our approach. 

Outline. We present an end to end description of WWT in 
Section 2. In Section 3 we describe the column mapper. In 
Section 4 we present algorithms to find the highest scoring 
mapping in the graphical model. Experiments, related work, 
and conclusions appear in Sections 5, 6 and 7 respectively. 

2. ARCHITECTURE 

Figure 2 shows the architecture of WWT. We first describe 
the main functions performed offline and then the process 
of answering a query online. 

2.1 Offline Processing 

We start with a Web Crawl from which we extract all data 
within the HTML table tags. However, not all of these are 
data tables because the table tag is often used for layout 
and for storing artifacts like forms, calendars, and lists. We 
depend on heuristics to retain tables that present relational 
information in a vertical format. On a 500 million pages 
Web crawl, we get an yield of 25 million data tables, which is 
roughly 10% of the table tags in the crawl. Similar statistics 
have been reported in [4, 3, 7]. A machine learning based 
approach [7] could have been more accurate but we did not 
have enough labeled data at the time. Instead, we decided 
to rely on query time relevance judgments to filter away 
non-data tables. Each extracted table is then processed and 
associated with the following two types of information: 

2.1.1 Headers 

Even though the HTML language has a designated tag 
( "th" ) for headers, only 20% of the 25 million Web tables 
used it. Most others depended on visual, font, and con- 
tent markers to indicate headers. After exploring several 
Web tables in an active learning framework, we designed 
an algorithm that marks headers as follows. The rows of 
a table are assumed to consist of zero or more title rows 
(for example Table 3 of Figure 1 has one title row), followed 
by zero or more header rows, followed by body rows. We 
scan rows sequentially from the top as long as we find rows 
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Figure 2: System architecture. 

different from most of the rows below it in terms of format- 
ting (use of bold face, italics, underline, capitalization, code, 
and header tags), layout (background color, CSS classes), or 
content (textual header with numeric body, number of char- 
acters). A 'different' row is labeled a title if all but the first 
column is non-empty. Else, we label that row as a header. 
All subsequent rows are labeled header as long as they are 
similar to the first header row and different from the rows 
below them. We stop as soon as a row fails this test. Over 
two thousand Web tables, we found that this heuristic failed 
only in five cases. In our corpus as per this method, 60% 
tables had one header row, 18% had no header, 17% had 
two header rows and the remaining 5% had more than two 
header rows. We could have used a less heuristic approach, 
say based on CRFs [14] but we found this approach to be 
very fast, and sufficiently accurate. 

2.7.2 Context 

The context is meant to include all the text in the parent 
document that tells us what the table is about. But this 
notion is much less objective than a header. So, we followed 
a policy of being generous in including text snippets in the 
context and, attaching a score to each text snippet for use 
during query processing. We depend on the DOM tree d of 
the parent document for extracting candidate snippets. Let 
T be the node in d containing the table. Any text node x 
that is a sibling of a node on the path from T to the root 
of d is added to the context. The score attached with x is 
calculated from two types of factors (1) the edge distance 
in tree d between x and T and whether x is a right or left 
sibling of the path nodes, and (2) the relative frequency in 
d of the format tags (bold, header, italics, underlined, etc) 
attached with x. We skip details of how exactly we do the 
combination due to lack of space. 

Finally, the extracted tables are indexed using Lucene and 
stored on disk. Each table is treated as a Lucene document 
with three text fields: header, context, and content. While 
indexing, we associate boost values of 2, 1.5, and 1 respec- 
tively with each of the three fields to control the relative 
importance of matches in the different fields for the column 
description queries. 

2.2 Query Time Processing 

Every user query Q consisting of say, q sets of keywords 
Qi, . . . , Q q is processed through the following stages. 



2.2.1 Find Candidate Web Tables 

We probe the index using the union of words in Qi , . . . , Q q . 
This gives us a set of tables T 1 = Ti, . . . T ni that match the 
query keywords in the header, context, or body. However, 
since many Web tables do not contain headers or useful con- 
text words, we use a second index probe to retrieve tables T 2 
based on content overlap with tables in T x . But, typically 
many tables in T 1 are irrelevant, and arbitrarily retrieving 
all tables with overlapping content in T 1 can introduce too 
much noise. We deploy the following conservative strategy. 
Invoke the column mapper (described in the next section) to 
find the top-two tables with very high relevance score from 
T . We may find no such tables for a query. We then add 
a random set 5* of ten rows from the confident tables and 
make a second index probe with the union of Q and 5*. In 
our experiments we found that the second stage index probe 
was used in 65% of the queries in our workload described in 
Section 5. For these, on an average 50% of all relevant source 
tables were obtained from the second stage. The fraction of 
relevant tables in the first stage was just 52% as compared 
to 70% in the second stage. This shows the usefulness of the 
second stage index probe. 

2.2.2 Column Mapping 

This step works on the candidate web tables from the in- 
dex probe Ti, . . . T n and decides for each table T; if it is rel- 
evant and if so marks for each query column Qt the column 
(if any) of T that maps to Qg. Also, with each relevance 
and column labeling decision it associates well calibrated 
probability scores since these are needed to fetch additional 
tables for the second index probe, and for ranking rows in 
the final table. The column mapper exploits a number of 
clues including matches in the header and context of a web 
table, the overlap in the content of two web tables, and the 
association of query keywords to a column's corpus in the 
entire corpus. In Section 3 we present how we combine these 
clues to solve the column mapping task. 

2.2.3 Consolidator and Ranker 

Given a set of tables output by the column mapper along 
with table-level relevance scores and column-level confidence 
scores, the consolidator merges relevant columns and rows 
into a single table. The main challenge in this task is decid- 
ing if two rows from different tables are duplicates, and we 
refer the reader to [9] for the approach we used to solve this. 
The ranker reorders the rows of the consolidated table so as 
to bring more relevant and highly supported rows on top. 
We skip details of our ranker since the focus of this paper is 
the column mapping task. 

3. DESIGNING THE COLUMN MAPPER 

We are given a query header Q with q sets of keywords 
Qi, . . . , Q q and a set of noisy web tables Ti, . . . T n . Each 
candidate web table has a variable number of columns and 
rows, may have zero or more headers, and contextual text 
extracted from around the table. Our goal in this step is 
to establish if a web table t is relevant to Q, and if so, 
label each column c of t with the query column to which it 
maps, or a special label na indicating no match to any query 
column. We call this the Column mapping task, with the 
understanding that it also subsumes the task of establishing 
if a table is relevant. 
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A simple method of solving this problem is as follows: 
First, establish the overall relevance of a table t to Q by 
suitably thresholding the sum of TF-IDF similarity of the 
keywords in Q to the context and header text of t. If t is 
found relevant, find the best matching of the query columns 
to the columns of t based on a thresholded similarity of each 
Qi to the header text of each column of t. 

This basic method has several shortcomings. First, be- 
cause the context text associated with a table is a very noisy 
descriptor of a table, the table-level relevance decision could 
get misled by unrelated verbosity in the context. Ideally, the 
table level relevance decisions should be made in conjunction 
with finding useful column mappings in the table. Second, 
since the headers in web tables are noisy and/or ambiguous, 
similarity scores with the header text alone is likely to be 
unreliable. We will show how to exploit other kinds of clues 
such as the context and body of the table, and other corpus 
wide co-occurrence statistics to augment the header-level in- 
formation. Finally, this method cannot handle tables with 
no headers, or tables with very little context or header text. 
In such a case we would like to exploit the content overlap of 
a column with columns of other tables. We will show in the 
experimental section that ad hoc ways of including header 
text from other overlapping table columns fails to yield good 
results. 

Instead of separately deciding for each web table whether 
it is relevant or not, and to which query label it maps us- 
ing independent similarity measures, we propose a more 
global approach to combine the diverse clues both within 
and across tables. We propose a unified approach based 
on graphical models that provides an elegant formulation 
to the task of jointly labeling all columns. We first give 
a brief overview of graphical models and then present our 
formulation. 

Graphical Models. A graphical model [10] expresses the 
joint distribution over a set of n random variables x = 
xi, . . . ,x n , where each xt belongs to a space of labels 
The model captures the dependencies between elements of 
x as a graph G, each a node in the graph, with a sparse set 
of edges as follows: 

We first identify small subsets of variables, called cliques, 
that are highly dependent on each other and form complete 
subgraphs of G. For each clique C, we then define poten- 
tial functions 9(C, xp) <-¥ that provides an un-normalized 
measure of compatibility among the labels xc assigned to 
variable subset C. Two special kinds of potentials are node 
potentials 9(i,Xi) defined on the label x% of a single node i 
and edge potentials 6(i,Xi,j,Xj) defined over edge in 
G and labels (xi,Xj) assigned to the two nodes it connects. 
The probability distribution is defined as 

Pr(xi,...,a;„) = — exp( ^ 8(C,x c )) 

CCcliqucs(G) 

where Z serves to normalize the product of clique potentials 
to a proper probability. A common method to define po- 
tentials is as a dot product between a model parameter and 
a feature vector. In case of node potential, we might write 
9(i,Xi) = wj f(i, Xi), where f : [1, M] — > k is the feature vec- 
tor and wi £ fe is the model vector where k is the number 
of features. Similarly, an edge potential would be defined 
as 6(i,Xi,j,Xj) = wJ{(i,Xi,j,Xj). The feature vector arc 
designed by the user whereas the model vectors wi , W2 are 



trained from labeled data. This feature-based framework 
provides a very convenient way to use a variety of intuitive 
clues without worrying about how to combine them numer- 
ically into a single objective. Once Pr(x) is defined and the 
model vectors are trained, the inference problem is to find 
argmax x ^2 C (0(C, xc)), the most likely joint assignment of 
labels to variables. When the graph has many large cliques, 
this problem is intractable and it is necessarily to design 
good approximation algorithms. 

We now show how we model the column mapping task 
as a graphical model by defining the random variables in 
Section 3.1, the node potentials in Section 3.2, the edge 
potentials in Section 3.3, other higher order potentials in 
Section 3.4, and the inference algorithm in Section 4. 

3.1 Variables 

Our task is to label each table as relevant or not, and label 
the columns of relevant tables with one of q query columns 
or na. Accordingly, a natural choice would be to create a 
graphical model where we have a variable t for each web ta- 
ble t £ {Ti, . . . ,T n } and a variable (tc) for each column c of 
table t. Each table variable is binary indicating if the table is 
marked relevant or not, and each column variable (tc) takes 
a label from the set {1, . . . , q} U {na}. However, this natural 
variable representation gives rise to an unnaturally compli- 
cated edge set to capture the interaction between a table 
variable, and all its columns — only when a table variable 
is labeled relevant, does it make sense to label the column 
variables. Also, to represent positive affinity among similar 
columns of only relevant tables we would need cumbersome 
four-variable potentials. Instead, we chose a representation 
with only column variables (that we index as tc) but aug- 
mented the label space of column variables with a label nr 
to indicate that the column is part of an irrelevant table. 
Thus, a column variable can take one of q + 2 labels from 
the set y = {1, . . . , q} U {na, nr}. 

3.2 Node Potentials 

Node potentials 9(tc, £) have to be defined for each column 
variable tc and each label I £ y. Their representation as a 
weighted sum of features provides a flexible mechanism to 
incorporate a wide variety of clues. We define five features 
that captures a diverse set of clues including the similarity 
between the query keywords and header, context, and body 
of web tables, and affinity between query keywords and con- 
tents of tables. These are defined in Sections 3.2.1 to 3.2.4. 

3.2.1 Matching Query Keywords to Table Columns 

The most important signal in mapping a web table col- 
umn (tc) to a query column I is the similarity between the 
query keywords Qt and the text in table t's context (if any) 
and text in c's header which could be absent or span over 
one or more rows. A standard method in IR to assign 
scores to multi-field documents is to compute the similar- 
ity of each field separately to the query keywords and take 
their weighted sum. Adapting to our case, we could sepa- 
rately measure the similarity of Qe to (1) the header tokens 
in (tc) and (2) the context of t. We next illustrate the vari- 
ous limitations of this method. 

First, the header text might contain only a part of the 
query keywords and the other part might appear in the con- 
text. For example, one query in our workload was "Nobel 
prize winner" . This retrieved several tables where the word 
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"Nobel prize" appears in the context and only "winner" ap- 
pears in the header. If we separately measure the similarity 
of Qe to (te)'s header and t's context, we will get low simi- 
larity from the header, and even if the similarity with con- 
text is high it does not help us decide which of the columns 
specifically matches Qe. 

Second, matches with tables having multiple header rows 
is not well defined. Should the header tokens be concate- 
nated and then matched to Qe, or should we compute simi- 
larity only with a single best matching row? Concatenation 
makes sense when a true header is split across multiple rows 
as in column 3 of Table 1 in Figure 1. However, concate- 
nation is bad when some header rows are wrong, or when a 
second header row presents irrelevant details. For example, 
in table 2 of Figure 1, the words "chronological order" in 
the second header row should not be used to diminish the 
similarity of the first column of Table 2 to Q3. In this case, 
going with the single-best option is better. 

Third, in some cases the evidence might come from a ta- 
ble's content. For example, a query string Qe in our dataset 
about "Black metal bands" matched a three column web ta- 
ble with respective headers "Band name | Country | Genre" 
and no context. The table contains names of many band 
types, of which "Black metal" is one. The only way to find 
this table as relevant is to exploit the frequent occurrence 
of the words "Black metal" in the third column in mapping 
the first column to Qe- 

Finally, in some cases the header text of some other col- 
umn c' of t might be needed to match column c to Qe. 
For example, another query in our dataset was about "dog 
breeds" which matched a web table with several columns 
two of which had headers "dog" and "breed". The "dog" 
column contained dog names, no where else was the token 
"dog" present in the table. 

We present a new similarity measure that removes all the 
above limitations in one unified function. Instead of sepa- 
rately matching whole of Qe to each field, we compute simi- 
larity via a two-part segmentation of Qe: one part is used to 
pin Qe to a specific header of t and the other part is used to 
gather relevance support from other parts of the table. Let 
qi, . . . ,q m be the sequence of tokens in Qe. Let h denote the 
number of header rows in t and let H rc denote the tokens in 
header row r of column c. We measure the similarity of Qe 
to a column c of web table t by segmenting Qe into two parts: 
a prefix P — qi, . . . ,q k and a suffix S = qk+i, ■ ■ ■ ,q m . One 
of the parts, prefix or suffix, is matched to a header row r 
of c, and the other part to portions of t outside the header, 
including the title of t, the context of t, frequent content 
in some column of t, other headers in column c (H r / C for 
r 7^ r), and other column headers of t in row r (i.e. H rc i for 
c' 7^ c). Let inSim(P, H rc ) denote the first similarity, that 
is, the score of matching the prefix P (or suffix S) to the 
header H rc and outSim(S', t, r, c) denote the second score, 
that is, the score of matching the remaining part (suffix S 
or prefix P) with the rest of the table. The segmented simi- 
larity SegSim(Q^, tc) is computed as the maximum weighted 
sum of scores over all possible values of r and all possible 
query segmentations. That is, 

||P|| 2 

SegSim(Q J .,te) = max max inSim(P, H rc ) 

r=i...h p,s s.t. \\Qe\\ 2 

PS=Q e VSP=Q e 
115112 

+ j L'LoutSim(5,t,r,c) (1) 



where ||P|| denotes the L2-norm of the TF-IDF vector over 
the tokens in P. The inSim(P, H rc ) similarity is the TF-IDF 
weighted cosine similarity between the token sequence P and 
H rc . The out Sim similarity is more challenging because we 
need to account for five sources of matches in the table: 
t's title and context, other header rows in column c, other 
header columns in r, and frequent body content tokens. For 
ease of notation we call these parts as T, C, H c , i/ r ,and B 
respectively. Matches in these different parts have different 
degrees of reliability. We characterize this reliability with 
a probability parameter pi for i 6 {T, C, H c , H r , B}. We 
calculated these empirically on our workload in Section 5 as 
follows: for each part i £ {T, C, H c , H r , B} of all Qe and 
relevant t, reliability Pi of part i is the fraction of correctly 
matched columns from all columns c with positive inSim 
and positive match with i. These values turned out to be 
(1.0, 0.9, 0.5, 1.0, 0.8) respectively for {T, C, H c , H r , B}. The 
score of a token is the soft-max of the reliability over the 
parts with which it matches. Finally, the outSim(S', t, r, c) 
value is computed as the sum of the soft-maxed match reli- 
ability of each term w weighted by the TF-IDF score TI(w) 
of the term as follows: 

outSim(S,t,r,c) = £ (1- ]J (1- Pi )) 

w€S ' ' ie{T,C,H r ,H c ,B} 
w Epart(i) 

The SegSim similarity has several nice properties. First, 
for multi-row headers it overcomes the shortcomings of the 
two extremes of full concatenation and single best. When 
the header has no spurious tokens outside the r th row, it 
reduces to the desired option of cosine similarity on fully 
concatenated headers. For the other extreme when all but 
one header is relevant, it reduces to the single best option. 
Second, when a token matches a table in multiple parts, the 
outSim score of the table increases but the influence of each 
additional match decays exponentially. Third, by requiring 
that the column header have a non-zero match with either 
the prefix or suffix of Qe, we ensure that table-level matches 
count only for specific columns rather than all columns of a 
table. 

3.2.2 Query Fraction Matched 

Another useful feature for defining node potentials is the 
fraction of query tokens that match the column header and 
other parts of the table. For the same value of SegSim sim- 
ilarity, we should assign higher score to the case where we 
have covered all the query terms, over the case where we 
have covered only a subset. We add a second feature called 
Cover(Qe,tc) that differs from SegSim in Equation 1 in only 
the definition of inSim(P, H rc ) — instead of cosine similarity 
we measure the weighted fraction of P's tokens that appear 
in H rc ( = p!p- J2 w ePnH rc: TI ( W ) 2 )- 

3.2.3 Affinity of Query Keywords to Column Content 

Instead of depending solely on the Web tables retrieved 
for a query, we could define features over the entire corpus 
D. One potential signal is the co-occurrence of keywords in 
Qe with the contents of a table column. Such signals have 
been used in [2] for ranking tables for topic queries, and in 
[12] for collecting class instance lists on web queries, and 
in other NLP tasks [20]. Adapting it to our case of query 
keywords Qe and contents of column (tc) we define a PMI 2 



912 



|H(Q,?)nB(Cell(*,r, e))| 2 
|H(Q £ )||B(CeU(t,r, C))| 



score as in [2] as 
PMI2(Q ^ C) = #Rc^) £ 

V ' rgRows(t) 

where H(Qe) denotes the tables in D that contain keywords 
in their header or context, and B(Cell(i, r, c)) denotes 
the tables that match the words in cell (r, c) of table t in 
their content. 

3.2.4 Table Relevance 

Our final feature for node potentials captures the overall 
relevance of a table to the union of query keywords. We 
define this as follows: 

R(Q, t) = -clip( maxCover(<3f , tc), min(g, 1.5)) (2) 
y £ei...q 

where the function clip(a, 6) is when a < b and a otherwise. 
Intuitively, this measures the total fraction of query words 
that match a table's header and context provided the match 
fraction is greater than 1 for single column queries, and 1.5 
for all other queries. 

Node Potential. Our final node potential 9(tc,£) = 

wiSegSim(<5^ , tc) + tt>2Cover(Q<, tc) 

„ , if 1 < £ < q 

+w 3 PMI 2 (Q i ,tc)+w 5 ~ 

min(q,n t ) _ r> ( n ,\\ -co 1 ' ' 



U>4- 




if I = nr 
if £ = na. 



where n t is the number of columns in table t. The values 
wi, . . . ,W4 capture the relative importance of different fea- 
tures and wg is a bias term which is negative and serves the 
purpose of disallowing maps to a query column based on 
very small similarity values. 

3.3 Edge Potentials 

Our edge potentials 8(tc,£,t'c' ,£') are defined over pairs 
of columns tc and t'c' and are used to capture the similarity 
in the content of columns across tables. A popular choice for 
edge potentials is the positive POTTS potential that assigns 
a positive score when the two variables take the same label, 
and is zero otherwise. The strength of the edge is tied to 
the similarity between the columns and is defined as: 
9(tc,£,t'c',£') = w e 8im(tc,t'c')[e = £'} Vtc,t'c s.t. t=£t' 
where [C] is 1 when C is true and otherwise, and w e is 
the weight of this edge feature. This potential fared poorly 
because a large fraction of web tables are irrelevant, yet 
some of their columns are similar to columns of relevant 
tables. These relevant columns get pulled toward the irrel- 
evant label because of the reward accrued from the many 
edges with irrelevant columns. We fix this problem by set- 
ting 6(tc,£,t' c' ,£') to zero when I = £' = nr. However, this 
fix causes the reverse problem: irrelevant tables start get- 
ting marked as relevant. When irrelevant tables are large 
in number and highly similar to each other, the edge po- 
tential terms overshadow the node potentials. We therefore 
designed a custom edge potential that differs from the above 
potential in three ways: 

Normalize Similarity. We first normalize the similarity 
measure so that the sum of the similarity of a column to 
other neighbors is bounded at one. For each column (tc), we 
normalize its similarity as nsimftc, t'c') — — snri(te,t c ) ^ — 

v ' A+£f,c slm (* c >*' 5 ) 

The smoothing constant A makes sure that the normaliza- 
tion does not inadequately boost similarity of a column that 



is weakly similar to only a few other columns. We choose 
A = 0.3 and ignore neighbors with unnormalized similarity 
below 0.1. The nsim values are asymmetric and we will see 
shortly how these are used to create symmetric edge poten- 
tials. 

Include Column Confidence. Second, we only add an edge 
potential across tables where at least one of the two columns 
is confident about its labeling independent of other tables. 
We measure confidence of a column's labeling using its node 
potentials to get a probability distribution Pr(£\tc) as de- 
scribed in Section 4.2. A column is confident only if Pr(£\tc) 
is large for some £ £ [1 . . . q]. We used a threshold of 0.6. 

Max-matching Edges. Third, for a table pair t, t', we con- 
nect each column c in the pair to at most one column in the 
other table instead of all the columns with which c might be 
similar. This provides more robust transfer of labels when 
columns within a table are similar to each other. We 
achieve this by first finding the best one-one matching be- 
tween the columns of two tables based on a weighted sum 
of their content and header similarity. We only add an edge 
between columns in the matching, and not between all pos- 
sible pairs of similar columns of two tables. 
The final edge potential 6(tc,£,t'c' ,£') = 

w e l£ = £' A £ ^ nr] (nsim(tc, t'c') [max Pr(y\t'c) > 0.6] 

j/^nr 

+ nsim(t'c',tc)[maxPr(y|tc) > 0.6]) (4) 
y^nr 

3.4 Table-level Potentials 

We define a set of hard constraints to control the consis- 
tency of the labels assigned to a web table. Each of these 
are defined over the labels of all the columns within a table 
and disallow inconsistent labelings by taking large negative 
values for inconsistent labels. Unlike for the node and edge 
potentials, there are no parameters to train for these poten- 
tials. We include the following four hard constraints. 

The mutex Constraint. This constraint requires that only 
one column in a table t be mapped to a query column. That 
is, 

<M£i, . . . ,£ nt ) = lY^ltj = ej < 1,W = 1 . . . qf^ (5) 

3 

where [CJ'loo takes a value zero when condition C is true 
and — oo otherwise. 

The ALL-Irr Constraint. This constraint ensures that if 
one column in a table t is assigned a label nr then all columns 
of that table must be assigned nr. This constraint helps to 
make consistent table-level relevance decisions from column- 
level decisions. That is, 

4> b {l u ...,£ nt ) = iJ^Vj = nr] G {0,^}]°^ (6) 

3 

The must-match Constraint. This constraint requires that 
every relevant table must contain the first query column. 

Mti, ...,£ nt ) = lJ2l £ 3 G {l.nr}] > 0]°^ (7) 



The min-match Constraint. Every relevant table must con- 
tain at least m of the q query columns. We chose m as two 
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for all queries with q > 2. 

Mil, ■ ■ • , *n t ) = G I 1 • • • 1, nr]] > m] .^ (8) 

i 

Overall Objective. Thus we have reformulated the Col- 
umn mapping task as the task of finding labels yt c £ y to 
each table, column pair (tc) so as to maximize the sum of 
node potentials (Equation 3) , edge potentials (Equation 4) 
and table consistency potentials (Equations 5 to 8). We 
denote the set of all labels as y, and the overall objective is: 

max ^6>(tc, ytc) + ^2 8(tc,ytc,t'c ,y t > c ') 

^tc ^ tc t'^t.c' 

node odgc 

+ '52<t>c{ytc 1 T--,ytc nt ) (9) 

Ce{a,b,c,d) t 

V v ' 

constraints 

The above objective has six parameters wi, . . . , W5, w e . We 
use a training phase with a separate labeled dataset to find 
the values of these parameters so that the error of the high- 
est scoring mapping is minimized. Since we had only six 
parameters, we were able to find the best values through 
exhaustive enumeration. More sophisticated training meth- 
ods, say based on max-margin structured learning [10] per- 
form well only under exact inference which is not possible 
for our objective because it is NP-hard. This can be proved 
using a reduction of the well-known NP hard problem of 
metric labeling [6] to our problem. 

4. INFERENCE ALGORITHMS 

Since the problem of finding the optimal column labels 
as per objective 9 is NP-hard, we present approximation 
algorithms in this section. We develop our approximation 
algorithm in stages. First in Section 4.1, we show that if 
the edge potential terms are absent, we can find the optimal 
column labels in polynomial time via a novel reduction to a 
bipartite matching problem. Next, we show two approaches 
to approximating the overall objective: table centric (Sec- 
tion 4.2) and edge-centric (Section 4.3) based on whether 
they give more importance to the table-specific potentials 
or across table edge potentials. 

4.1 Table Independent Inference 

We show how to optimally solve objective 9 without the 
edge potentials. The optimal labeling of the different tables 
now get decoupled. For each table t we solve a generalized 
maximum matching problem in a bipartite graph Gt cre- 
ated as follows. The n t columns of t forming the left nodes 
of G t and labels 1, . . . , q, na forming the right nodes. The 
weight of an edge between a left node c and a right node 
£ is 0(tc, £) + Mi where Me is a large positive constant for 
£ = 1, and zero otherwise. In addition, each node v is as- 
signed a capacity c v that is one for all nodes except the right 
node na for which is c v = max(0, n t — m) where m is the 
minimum number of columns required to be matched in the 
min-match constraint. Find a matching, that is a subset 
S of the edges in Gt, such that each node v has no more 
than Cv adjacent edges in S and the sum of edge weights in 
S is maximized. Such a matching can be found in O(rijg) 
time using the well-known reduction to the min-cost max- 
flow problem (summarized in Section 4.2.1). The edges in 
the optimal S give the highest scoring column mapping y t 



of t under the constraint that t is relevant. We contrast the 
score of yt with the score of labeling all columns of t as nr 
and pick the higher of the two as the final labeling. 

We skip a formal proof of correctness and instead state 
some intuitive claims as justification. The capacity one con- 
straint on all right side nodes in 1, . . . ,q ensures that the 
mutex condition is met. The large additive constant Mi 
ensures that the highest scoring labeling must assign at least 
one column to label 1 thereby meeting the must-match con- 
straint. The constraint that no more than n t — m columns 
of t can be assigned label na ensures that at least m of them 
must be mapped to a query column, thus satisfying the min- 
im ATCH constraint. 

4.2 Collective Inference: Table-centric 

This collective inference algorithm is centered on opti- 
mal inference at the table-level, and uses edges to influ- 
ence table-level decisions very cautiously. The algorithm 
works in three stages. First, independently for each table 
t find the highest possible score Htc(£) of objective 9 (ig- 
noring edge potentials) when each column c is constrained 
to take each label I. Normalize these to define a distri- 
bution as Vtc(£) = cxp ^/ c ^j» ) xx . Second, for each table 
column tc collect messages from each of its neighbors as 
msg(tc,£) = J2 t 'c' enbr (tc) w e nsim(tc,t'c')p t , c ,(£). Finally, 
independently for each t invoke the algorithm in Section 4.1 
with node potential modified as max(msg(fc, £), 6(tc, £)). 

The only difficult part of the above algorithm is the effi- 
cient computation of the /-itc(£) values. The direct method of 
computing fj,tc(£) by repeated calls to the maximum match- 
ing algorithm of Section 4.1 for each (c,£) pair is expensive. 
We propose a method of making this efficient by simulta- 
neously computing all the values. Before we present our 
algorithm we need to briefly recap the popular flow-based 
algorithm for maximum matching in bipartite graphs. Read- 
ers familiar with this topic can skip Sections 4.2.1 and 4.2.2 
and jump to Section 4.2.3. 

4.2.1 Recap: Maximum Weight Bipartite Matching 

Given a weighted bipartite graph B where each node u 
has a positive capacity c u and each edge (u, v) has weight 
w(u, v), find subset S of edges of B such that £\ - gS w(u, v) 
is maximized and J2 V :(u,v)eSv(v,u)es 1 - c "- A well-known 
algorithm [13] to solve this problem is by finding the max- 
imum flow on a weighted directed graph G created from B 
as follows. First, we need to ensure that the total capacity 
of nodes on the left side of B is equal to the total capacity 
on the right. If this condition is violated we add a dummy 
node d on the deficient side with the deficient capacity. Call 
the new set of left nodes L and right nodes R. G is seeded 
with these nodes and two extra nodes: a source node s, and 
a sink node t. Second, add edges from s to each u £ L with 
cost(s, u) = and cap(s, u) = c u , from each u G R to t with 
cost(ii, t) = and cap(u,t) = c u , and from each u £ L to 
each v £ R with cap(u, v) = min(c u , c v ) and cost(u, v) = if 
either u or v is the dummy node d, else cost(u, v) = —w(u, v) 
Invoke the algorithm described in the next section to find 
the minimum cost maximum flow on G. The final matching 
consists of all edges from L to R with positive flow. 

4.2.2 Recap: The Max-flow Algorithm 

The input to the algorithm is a directed graph G with 
edge costs cost(u, v) and cap(u, v) and two special vertices 
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Input: Gt = Weighted bipartite graph (Section 4.2.3) 
G*f,m*, Opt = Residual graph and optimal matching ob- 
tained by algorithm in Section 4.2.1 
for I G [1 . . . q] U {na} do 

Find shortest distances d(£, .) in Gf from I to all nodes. 

Htc{t) = Opt — d(£, c) — cost(c,£), Vc = l,...,m 
end for 

Attc(nr) = J2 c e ( tc ' m ) 

Figure 3: Finding all max-marginals in Gt- 

s, t. The goal is to push the maximum flow from s to t such 
that the flow f(u, v) along each edge is < cap(tt, v) and the 
total cost „\ cost(u, v)f(u, v) is minimized. The Max- 
flow algorithm [13] starts with associating each edge (u, v) 
with f(u, v) = 0. Define with each edge (u, v) a residual 
capacity res(it, v) = cap(tt, v) — f(u, v) that measures how 
much extra flow can be pushed along that edge without vio- 
lating the capacity constraint. For every (u,v) in the graph, 
there is an implicit reverse edge (v,u) with cap(v,w) = 0, 
f(v,u) = —f(u,v), and cost(w,w) = — cost(w, v). This im- 
plies that a flow along (u, v) can be reversed by pushing a 
flow in the reverse direction from v to u. Every set of flow 
values / defines a new residual graph Gf out of G com- 
prising only of edges with positive residual capacity. The 
algorithm proceeds by repeatedly finding the shortest cost 
path P from s to t in Gf, pushing the maximum possible 
flow f(P) along P in G, thereby modifying / and getting a 
different residual graph from G. The algorithm terminates 
when no path can be found from s to t in Gf. 

4.2.3 Finding Max-marginals 

We now show how to efficiently compute Htc{£) the max- 
imum possible score of Equation 9 (without the edge po- 
tentials) under the constraint that column c takes label I. 
These are called max-marginals. When computing the max- 
marginals, it is important to exclude the must-match and 
min-match constraints because otherwise the relative mag- 
nitude of /i for different I can be distorted when some high 
scoring labeling violates the MUST-MATCH or min-match 
constraint. Thus, for each t Htc(£) = 

max „ y^0( te '>^ c ')+<Mytc 1 , . . . ,ytc n , )+<t>b(ytc!, ■ ■ -,ytc n .) 

yytc=L*—f 

(10) 

We create a bipartite graph Gt as in Section 4.1 with the 
only difference that the addition Mi to edge weights is left 
out since the must-match constraint is dropped and the 
capacity of the right node na is nt since the min-match 
constraint is dropped. We then find the maximum weight 
bipartite matching (recalled in Section 4.2.1) on Gt- Let 
m* denote the node to which c is matched in the optimal 
matching m* , Opt denote the sum of edge weights in the 
optimum matching, and let Gf be the final residual graph 
from the Max-flow algorithm (recalled in Section 4.2.2). For 
each c, and for each £ 7^ m*, we need to force a matching 
(c, I) and compute the rest of the matching. Since the ca- 
pacities of the two sides were balanced in m* , we need to 
remove unit flow from m\ to I and from c to m* and add 
flow from c to I. The smallest cost method to make this 
change is by reverting the flow along the shortest path P 
from I to c in G*f. Let d(£, c) be the cost of this path. The 
maximum weight matching under the (c,£) constraint can 
be shown to be equal to Opt — d(£, c) — cost(c, I). Thus, in 
order to find the maximum matching under all possible pair 



constraints, we need to find single source shortest paths from 
each right node £ in the residual graph. Since edge costs can 
be negative, we need to use the Bellman Ford shortest path 
algorithm to find these shortest paths. Each invocation of 
Bellman Ford's algorithm on a graph with N nodes and E 
edges takes O(NE) time, and we do this q + 1 times. A 
pseudo code of the algorithm appears in Figure 3. 

4.3 Collective Inference: Edge-centric 

The table-centric algorithm gives more importance to table- 
level constraints than to edge potentials. We wanted to 
compare this approach with an opposite approach of giving 
edge potentials central importance. In these approaches, 
called edge-centric approaches, the table-level constraints 
are either expressed as edge potentials, or are handled in 
a post-processing phase. Another reason for exploring these 
approaches is that there are many existing inference algo- 
rithms that work on graphical models with only node and 
edge potentials [10, 1, 11, 18]. These include message pass- 
ing algorithms like belief propagation [10], TRWS [11], and 
MPLP [18], and graph cut-based algorithms like a-expan- 
sion algorithm [1]. Of these, the a-expansion algorithm is 
known to be very efficient [19] and was the best performing 
of the edge-centric algorithms in our experiments. In this 
section we go over how we adapted the algorithm to handle 
the table constraints. 

The a-expansion algorithm only works for edge potentials 
whose negated forms behave like a metric. That is, for each 
pairwise edge potential function <f> it requires that (f>(£i,£j) — 
(f>(£j,£i), and <p(£i,£j) — <j>(£j,£ k ) < <f>(£i,£ k ). It is easy to 
verify that our edge potentials in Equation 4 satisfy this 
property. 

Also, of the four constraints, the ALL-Irr constraint in 
Equation 6 can be written as the sum of metric edge poten- 
tials defined over each pair of columns, that is, <j>b(£i , ■ ■ ■ , £n t ) 
can be expressed as Y^j=i Eti <t>B(£i,tj) where 

4>B{£i,£j) = Wi = nr] + [tj = nr] ^ if... (11) 
We first present an overview of the a-expansion algorithm 
and then show how we modify it to incorporate the remain- 
ing three of the table constraints. 

The a-expansion Algorithm [1]. The algorithm main- 
tains a labeling y of the variables at all times, and in each 
step improves it through local moves. Initially y can be ar- 
bitrary — say, all vertices assigned label na. Next, for each 
label a (a € y) an a-expansion move switches the labeling 
of an optimal set of vertices from their current label in y 
to a. This is repeated in a round over the q + 2 labels un- 
til y remains unchanged in a complete round. For graphs 
whose edge potentials satisfy the metric condition, an opti- 
mal a-expansion move is obtained from the minimum cut in 
a weighted directed graph G a , y created as follows: To the 
original graphical model H add two special vertices s, t. Add 
an edge from s to each v G V(H) and from every v G V(H) 
to t. Edge weights are derived from the potentials of H , the 
current a and current labeling y such that the minimum 
s-t cut gives the optimal subset of variables S on the t side 
whose labels in y should be switched to a. We refer the 
reader to [1] for details on how to set these edge weights. 

We next present our modification to the above algorithm 
to handle the remaining constraints. We will shortly present 
a novel modification for handling the mutex constraint. The 
must-match and min-match constraints cannot be easily 
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Input: G, s,t and disjoint vertex groups Vi, . . . Vt- 
Gf,S = Residual graph and t side vertices after applying 
Max-flow to find the minimum s-t cut on G 
while 3 k s.t. |S n 14| > 1 do 
for all i s.t. \S H Vi\ > 1 do 
for v € {U, = S n Vt) do 

f(v, Vi) — maximum additional flow in G if for each 
u G Ui — {«}, cap(s,it) is increased to oo. 
end for 
end for 

Pick i*,v* — axgmm i v f(v,Vi), set cap(s, u) = oo for 
each u G Ui* — {v*} and modify flow in Gf 
S — vertices on the t side of the cut 

end while 

Output: S 

Figure 4: Constrained minimum s-t cut algorithm. 

integrated with the above algorithm. So, we handle these 
as a post-processing step as follows: In the output label- 
ing if any table t violates the must-match or min-match 
constraint we greedily fix its labels by invoking the table 
independent algorithm (Section 4.1) on t. 

To handle the mutex constraint, we modify the a-ex- 
pansion move for a £ [1 ... q] so that at most one column 
variable in each table switches to a. We achieve this by 
solving the following constrained s-t cut problem on G a , y 
for a G [1 . . . q] ■ 

The Constrained Minimum s-t cut Problem. Given a 
positive weighted directed graph G — (V, E, w) whose vertex 
set can be partitioned into disjoint subsets Vi, . . . Vt and two 
special vertices s, t, find the minimum s-t cut such that at 
most one vertex in each Vi is on the t side of the cut. In our 
case the groups Vi correspond to columns of the same table. 
This problem is NP-hard unlike the unconstrained minimum 
s-t cut problem which can be solved in 0(V ElogV) time. 
We are able to provide a factor of two approximation. In this 
paper we skip proofs. Instead we present an approximation 
that performed well on our problem. 

A popular method for solving the minimum s-t cut prob- 
lem is based on a reduction to the Max-flow problem that 
we reviewed in Section 4.2.2. We first show this reduction 
and then present our modification. 

Minimum s-t cut via Max-flow [13]. The Max-flow algo- 
rithm is invoked on G with a uniform cost of one on all 
edges and cap(«, v) — w uv where w uv denotes the weight of 
an edge in G. Let Gf be the residual graph when the Max- 
flow algorithm terminates. The minimum cut C comprises 
of edges for which f(u, v) = cap(w, v). 

Our Modifications for Constrained Cuts. Let S be the 

set of vertices that remain connected to t after the edges 
C in the unconstrained cut are removed. C is optimal for 
the constrained case if for each vertex group Vi at most one 
member appears in S. Else we repeat the following steps 
until all constraints are satisfied. Find a violated group Vi 
for which \SC\ V t \ > 1. Let U t = Sn V t . We force all but one 
vertex v* in Ui to the s side of the cut. Such a v* is picked as 
that vertex for which we can push the minimum additional 
flow in the residual graph Gf with all Ui — {v*} connected 
to s via infinite capacity edges. The final algorithm appears 
in Figure 4. 



5. EXPERIMENTS 

Creating a useful query set for this project is challenging 
in an academic setting without access to a user base. The 
best we found was the list of topic queries in [2] collected 
via a Amazon Mechanical Turk service. They represent an 
unbiased workload because these are suggested by a diverse 
set of web users, and not hand picked by researchers. The 
query set comprised of 51 queries spanning topics such as, 
"north american mountains", "professional wrestlers", and 
"world's tallest buildings". We call this the AMT query set. 
However, since our task required multi-column queries, we 
converted each topic query in AMT to a multi column query 
as follows: For each topic query in AMT, we inspected the 
top-10 pages from a Google search and found between one 
and three prominent attributes to be attached with each 
query. The final list of multi-column queries is present in 
Table 1 1 . The original topic queries can be found in Table 
1 of [2]. We augmented the AMT queries with twelve more 
queries that we collected internally from Wikipedia tables. 
This gave us a total of 59 queries of which 5 were single- 
column queries, 37 were two column queries, and 17 were 
three column queries. We list the 59 queries in Table 1. 

We run these queries on a web crawl of 500 million pages 
from which we found 25 million data tables. For evaluat- 
ing the accuracy of the Column mapping task we posed 
each query in our query-set on WWT and for each table 
returned by the index probe we collected the correct label- 
ing by manually labeling each of the 1906 web tables over 
the 59 queries. This gave us the set of ground truth labels. 
Each labeling was reviewed by two human labelers to avoid 
any human errors. In Table 1 we list the total number of 
source web tables that were returned by the two-phase in- 
dex probe in WWT and the number of relevant web tables. 
Overall for a query, we found between and 68 web tables 
with an average of 32.29. The fraction of relevant matches 
varied between and 100% and on average only 60% of web 
tables were relevant. 

We evaluate different methods via the well-known Fl error 
measure. This is calculated as follows: Let y* denote the 
vector of all column labelings in the ground truth. The Fl 
error of any other labeling y produced by a method is 

, = vtc a vtc e [l . . . gj] x loo 

errors, y ; £ tc [j, tc G [1 •■■?]] + J2 tc htc €[l...q]j 

We compare the accuracy of the Column mapping task on 
the following methods 

1. Basic: A baseline method that is described in the be- 
ginning of Section 3. 

2. NbrText: Basic augmented to include text of similar 
columns to measure similarity sim(Q^,te) as 

max(TI(Q^, tc), maxsim(te, t'c)TI(Qe , t'c)) 

t' c' 

3. PMI 2 : The basic method augmented with only PMI 2 
scores defined in Section 3.2.3 

4. WWT: Our graphical model-based approach described 
in Section 3 with the table-centric algorithm. 

For reporting results, we partitioned the queries into two 
sets: an "easy" set for which all methods were within 0.5% 
of each other. These include queries for which either the 
column labeling task is trivial, or impossible to solve with 

*We skipped 4 queries from AMT, because we could not 
interpret the appropriate column queries for them. 
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Source Tables 


pain killers | company 


1 


1 


■Single Column Queries 


Total 


Relevant 


pga players | total score 




29 


dog breed 


68 


66 


pre-production electric vehicle | release date 


3 





kings ot alrica 


26 





running shoes model | company 


1 1 


5 


phases of moon 


56 


17 


science discoveries | discoverers 


41 


37 


prime ministers of england 


35 


3 


university | motto 


7 


5 


professional wrestlers 


52 


52 


us cities | population 


34 


32 


Two Column Queries 


Total 


Relevant 


us pizza store | annual sales 


35 


1 


2008 bcijing Olympic events | winners 


29 





usa states | population 


41 


37 


2008 Olympic gold medal winners | sports/event 


26 





used cellphones | price 


29 





australian cities | area 


30 


4 


video games | company 


30 


28 


banks | interest rates 


51 


34 


wimblcdon champions | year 


38 


24 


black metal bands | country 


39 


19 


world tallest buildings | height 


51 


12 


books in United States | author 


6 


2 


Three Column Queries 


Total Relevant 


car accidents location | year 


46 


8 


academy award category | winner | year 


56 


22 


clothing sizes | symbols 


20 





bittorrcnt clients | license | cost 








composition of the sun | percentage 


50 


12 


chemical element | atomic number | atomic weight 


33 


30 


country | currency 


56 


53 


company | stock ticker | price 


53 


53 


country | daily fuel consumption 


38 


14 


educational exchange discipline in US | number of 


13 


2 


country | gdp 


58 


56 


students | year 






country | population 


58 


55 


fast cars | company | top speed 


34 


29 


country | us dollar exchange rate 


52 


43 


food | fat | protein 


47 


43 


fifa worlds cup winners | year 


49 


9 


ipod models | release date | price 


44 


16 


Golden Globe award winners | year 


23 


19 


name of explorers | nationality | areas explored 


1!) 


13 


Ibancz guitar series | models 


21 


3 


NBA Match | date | winner 


44 


34 


Internet domains | entity 


10 


4 


new Jcdi Order novels | authors | year 


25 


24 


James Bond films | year 


16 


11 


Nobel prize winners | field | year 


12 


10 


Microsoft Windows products | release date 


25 


12 


Olympus digital SLR Models | resolution | price 


11 


3 


MLB world scries winners | year 


13 


3 


president | library name | location 


8 


1 


movies | gross collection 


57 


57 


religion | number of followers | country of origin 


37 


32 


name of parrot | binomial name 


11 


8 


Star Trek novels | authors | release date 


8 


8 


north american mountains | height 


47 


28 


us states | capitals | largest cities 


32 


30 



25% 
15% 
5% 
-5% 
-15% 
-25% 



J J II I 



1 2 3 4 5 6 
IPMI NbrText BWWT 

Query Group 



Table 1: 

Grp Basic 

1 98.8% 

2 72.7% 

3 54.1% 

4 45.1% 

5 31.8% 

6 19.7% 

7 4.2% 



Figure 5: Error reduction relative to Basic in the 
column map performance of WWT, NbrText and 
PMI 2 . The table alongside shows the error of Basic. 

our set of clues. One-third of our queries were "easy" and 
all methods had an error close to 22% on these queries. 
We compare different methods on the remaining two-third 
"hard" queries. For ease of plotting, we divided the hard 
queries into seven groups by binning on the error of the Ba- 
sic method. 

We first compare the above four methods. We then an- 
alyze the impact of the segmented similarity measure and 
finally compare different collective inference algorithms. 

5.1 Overall Comparison 

In Figure 5 we show the reduction in error relative to 
Basic of the three methods: PMI 2 , NbrText, and WWT 
on the seven query groups. Overall, WWT incurred an er- 
ror of 30.3% in contrast to 34.7% for Basic and PMI 2 , and 
34.2% for NbrText. For the first query group (comprising of 
seven queries), all three methods failed to label all but one 
as relevant whereas WWT got 22% of them. Even for the 
last group where Basic already achieved a low error of 4.2, 
WWT reduced that by 10% to 3.6%. The NbrText method 
reduces error for some queries but incurs an increase in er- 



Query set. 

ror in several others. The method is specifically bad when 
the columns within a table overlap. An example is the last 
query in Table 1. The NbrText method suffers because the 
columns containing capitals and largest cities are overlap- 
ping, causing the wrong header text to be imported in a 
column. 

Surprisingly, we did not get any accuracy boost overall 
with the PMI 2 score unlike what is reported in [2]. Even if 
we restrict to single column queries in our set, WWT's error 
is 28.9% compared to 33.3% of PMI 2 used in [2]. One reason 
is that the PMI scores are noisy — while they reduce error in 
seven of the queries, they also caused an increase in an equal 
number mostly due to adding irrelevant columns. Other 
studies [20] have attributed such behaviour to the undue 
importance the PMI score gives to low frequency words due 
to their presence in the denominator. Also, the PMI 2 score 
is expensive to compute. The average time for a query was 
6.3 seconds for Basic, 40 seconds for PMI 2 and 6.7 seconds 
for WWT, which does not use the PMI 2 scores by default. 

We show the impact of improved column mapping on over- 
all search performance in Figure 6 where the y-axis is the 
error comparing the rows of the consolidated answer table 
of true column mapping and the consolidated answer table 
of any other mapping. We observe that WWT yields signifi- 
cant improvements in the accuracy of the final answer in all 
cases. 

In Figure 7 we show the total running time of WWT bro- 
ken into the time taken to probe the index (stored on disk) 
in each of the two stages, the time to read and parse the raw 
tables from disk, the time for column mapping, and the time 
to consolidate (and dedup) the rows of relevant tables. The 
running time varies from 1.5 to 14 seconds with an average 
of 6.7 seconds. The key factors affecting running time are 
whether a second index probe was used for the query, the 
size of the raw tables and the number of relevant rows that 
are consolidated into the final answer. The time for column 
mapping is a negligible fraction of the total time. 
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Figure 6: Answer qual- Figure 7: Running 
ity. time. 




0% 20% 40% 60% 80% 100% 

Error: Unsegmented similarity 

Figure 8: Comparing segmented similarity with 
standard unsegmented IR similarity. 

5.2 Evaluating Segmented Similarity 

In this section we evaluate our segmented similarity mea- 
sure of Section 3.2.1 by comparing it against a model which 
is identical in all respects except using plain cosine simi- 
larity with the header text instead of SegSim and Cover. 
We call this the unsegmented similarity measure. For each 
measure we retrained the model parameters so that they are 
best tuned for the selected similarity measure. Overall the 
segmented measure reduced error from 33.3% to 30.3%. In 
Figure 8 we show a scatter plot where each point represents 
the error of a query using segmented measure against error 
of the unsegmented measure. We see that for all but three 
of the 32 queries, the segmented approach lies below the 45 
degree line, and in eight cases it reduces by more than 10%. 

5.3 Comparing Collective Inference Methods 

In Table 2 we compare a baseline (in column 2) that 
independently labels each table with the following collec- 
tive inference algorithms: the constrained a-expansion al- 
gorithm, two edge-centric message-passing algorithms viz., 
Belief propagation (BP) and TRWS, and the Table-centric 



Group 


None 


a-exp 


BP 


TRWS 


Table-centric 


1 


76.0 


76.0 


74.5 


76.0 


75.1 


2 


72.4 


72.4 


71.6 


72.4 


63.3 


3 


53.3 


43.4 


61.1 


51.4 


52.3 


4 


43.2 


45.4 


40.4 


41.5 


37.5 


5 


31.2 


26.6 


26.9 


31.3 


27.1 


6 


22.4 


20.1 


20.1 


18.8 


18.0 


7 


4.0 


4.0 


4.2 


4.0 


3.8 


Overall 


33.1 


31.3 


31.5 


32.3 


30.3 



Table 2: Comparing different collective inference al- 
gorithms on Fl error over seven query groups sepa- 
rately and over all queries combined. 



algorithm. The BP and TRWS incorporated the table con- 
straints in the same way as a-expansion for all but the MU- 
tex constraint. The mutex constraints were reduced to 
edge potentials like the ALL-Irr constraint — we skip de- 
tails due to lack of space. 

Overall, the Table-centric algorithm achieves the lowest 
error of 30.3% which is roughly 3% lower than the baseline 
of no collective inference. This is followed by our constrained 
a-expansion algorithm. BP is slightly worse and TRWS is 
the worst. One reason is that BP and TRWS are known not 
to provide good approximations when many edge potentials 
are dissociative (i.e., they prefer connected nodes to take 
different labels). In our graphical model, in expressing the 
mutex constraints as edge potentials we created many disso- 
ciative edges. In contrast, for the a-expansion algorithm we 
handled the mutex constraints separately via a constrained 
graph-cut algorithm. In order to understand the reasons for 
the poorer performance of a-expansion vis-a-vis the table- 
centric algorithm, we considered two explanations. First, 
that the edge potential scores in our objective were not well 
calibrated causing higher scoring labelings to not necessar- 
ily have lower error, and the Table-centric was winning by 
giving lower importance to such potentials. Second, that 
the a-expansion algorithm was solving the objective (with 
all the constraints) poorly. We found the latter to be the 
reason. In most cases where a-expansion lost it returned 
labelings with lower overall scores than the table-centric al- 
gorithm. In terms of running time, the table-centric algo- 
rithm is the fastest followed by a-expansion which is a factor 
of five slower, followed by BP a factor of six slower and fi- 
nally TRWS which is a factor of 30 slower. Thus, the most 
practical option for collective inference is the table-centric 
algorithm both in terms of accuracy and running time. 

6. RELATED WORK 

We discuss four areas of related work: structured Web 
search, Web tables in structured search, keyword queries on 
databases, and schema matching. 

Structured Web search has been a topic of research ever 
since the advent of the Web (see [5] for a survey). The most 
common type of structured search is point answers to a query 
such as: "CEO of IBM"; and most research has focused on 
harnessing document sources to answer them [5] . Our focus 
is on queries whose answer is a single consolidated table. In 
earlier work [9] we developed a query-by-example paradigm 
of extracting such tables from lists on the Web (web ta- 
bles were excluded). Another example is Google squared 2 , 
a commercial product, that interpreted query keywords as 
description of entity types and the answer was a table with 
entity instances and a suggested list of attributes. Technical 
details of the system are not in the public domain. 

The potential of web tables as a source of structured in- 
formation was first highlighted in [4, 3]. In this work, the 
stress was on collecting offline information of various kinds, 
including attribute synonyms, attribute associations, etc. In 
contrast, our goal is to answer ad hoc queries, for which the 
closest related work is [2] . Their system is based on multiple 
user interactions: first in response to a single set of keywords 
a ranked list of tables is retrieved, the user integrates the 
sources into a single table, the user can then choose addi- 
tional attribute columns, and the system fills in the values 

2 http://en. wikipedia.org/wiki/Google_Squared 
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from additional sources. Only the method used for estab- 
lishing table relevance in the first step above is related to our 
goal in this paper. The method they propose for relevance 
ranking is to use the PMI score that we have already shown 
in Section 5.1 to not be as effective for our task of column 
labeling. 

Keyword search on tables is now an established area of re- 
search [21]. Most early work was on clean databases where 
each entity type has a distinct table with well defined col- 
umn names, types, and primary keys. Recently [17] presents 
a probabilistic algorithm for annotating parts of keyword 
queries with table names, attribute names, and selection 
predicates on a set of product catalogs. A related problem 
is tackled in [15] where a keyword query over an Ontology 
is broken into a structured query over the entity, types, and 
relationships in the Ontology. A database of web tables is 
entirely different from such databases: there is huge redun- 
dancy, no well-defined schema, no standard syntax for spec- 
ifying column names, and the scale is orders of magnitude 
higher. 

Our task could be viewed as a schema matching problem 
between the query columns Q and a Web table T. Schema 
matching [16, 8] has traditionally been applied for integrat- 
ing two databases, each of which contains a consistent and 
clean set of tables and the main challenge is in managing 
the complex alignment between the large number of schema 
elements on each side. In contrast, in our case we are match- 
ing a few query columns to a large number of unlinked and 
noisy web tables. This gives rise to a very different problem 
structure — since a small fixed number of query columns 
are matched we can cast this matching task as a column la- 
beling task. The building blocks used for matching a query 
table to any single web tables is similar to those used in 
schema matching but collectively performing such binary 
matchings over several web tables is new to our setting. 

7. CONCLUSION 

We presented the design of a system for getting as answer 
a multi-column table in response to a query specifying a 
set of column keywords. We described the many non-trivial 
steps required in the processing of such queries such as ex- 
tracting headers and context of tables, retrieving tables via a 
two staged index probe, and mapping columns before consol- 
idating them into the answer table. The focus of this paper 
was the column mapping task. Our representation of this 
task as a graphical model allowed us to jointly decide on the 
relevance and mappings of all candidate table columns while 
exploiting a diverse set of clues spanning matches in different 
parts of the table, corpus-wide co-occurrence statistics, and 
content overlap across table columns. We presented a novel 
method of matching query keywords to table columns via a 
two part query segmentation method, and a robust mech- 
anism of exploiting content overlap across table columns. 
Experiments on a realistic query workload and a database 
of 25 million web tables showed a 12% reduction in error rel- 
ative to a baseline method. Future work in the area include 
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